Classifying informative and imaginative prose using complex networks
نویسندگان
چکیده
Statistical methods have been widely employed in recent years to grasp many language properties. The application of such techniques have allowed an improvement of several linguistic applications, which encompasses machine translation, automatic summarization and document classification. In the latter, many approaches have emphasized the semantical content of texts, as it is the case of bag-of-word language models. This approach has certainly yielded reasonable performance. However, some potential features such as the structural organization of texts have been used only on a few studies. In this context, we probe how features derived from textual structure analysis can be effectively employed in a classification task. More specifically, we performed a supervised classification aiming at discriminating informative from imaginative documents. Using a networked model that describes the local topological/dynamical properties of function words, we achieved an accuracy rate of up to 95%, which is much higher than similar networked approaches. A systematic analysis of feature relevance revealed that symmetry and accessibility measurements are among the most prominent network measurements. Our results suggest that these measurements could be used in related language applications, as they play a complementary role in characterizing texts.
منابع مشابه
Computational Analysis Of Predicational Structures In English
The results of a computational analysis of all predications, finite and non-finite, in a one-million-word corpus of present-day American English (the "Brown Corpus") are presented. The analysis shows the nature of the syntactic differences among the various genres of writing represented in the data base, especially between informative prose and imaginative prose. The results also demonstrate th...
متن کاملSUC-CORE: SUC 2.0 Annotated with NP Coreference
SUC-CORE is a subset of Stockholm Umeå Corpus 2.0 and Swedish Treebank, annotated with noun phrase coreference. While most coreference annotated corpora consist of texts of similar types within related domains, SUC-CORE consists of both informative and imaginative prose and covers a wide range of literary genres and domains.
متن کاملSUC-CORE: A Balanced Corpus Annotated with Noun Phrase Coreference
This paper describes SUC-CORE, a subset of the Stockholm Umeå Corpus and the Swedish Treebank annotated with noun phrase coreference. While most coreference annotated corpora consist of texts of similar types within related domains, SUC-CORE consists of both informative and imaginative prose and covers a wide range of literary genres and domains. This allows for exploration of coreference acros...
متن کاملGrammatical word class variation within the British National Corpus Sampler
This paper examines the relationship between part-of-speech frequencies and text typology in the British National Corpus Sampler. Four pairwise comparisons of part-of-speech frequencies were made: written language vs. spoken language; informative writing vs. imaginative writing; conversational speech vs. ‘task-oriented’ speech; and imaginative writing vs. ‘task-oriented’ speech. The following v...
متن کاملComments on Nonfinite Adverbial Patterns in English Prose Fiction: A Simple Model for Analysis and Use
This study aims to present an accessible model of some frequent nonfinite adverbial types occurring in English prose fiction. As its main syntactic argument, it recognizes that these adverbials are mostly elliptical in that there are some dependent-clause markers one can assume to be implicit when supplying those elements back into the clause complex. Some comments are provided at the end on th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1507.07826 شماره
صفحات -
تاریخ انتشار 2015